15 research outputs found
Insights into Analogy Completion from the Biomedical Domain
Analogy completion has been a popular task in recent years for evaluating the
semantic properties of word embeddings, but the standard methodology makes a
number of assumptions about analogies that do not always hold, either in recent
benchmark datasets or when expanding into other domains. Through an analysis of
analogies in the biomedical domain, we identify three assumptions: that of a
Single Answer for any given analogy, that the pairs involved describe the Same
Relationship, and that each pair is Informative with respect to the other. We
propose modifying the standard methodology to relax these assumptions by
allowing for multiple correct answers, reporting MAP and MRR in addition to
accuracy, and using multiple example pairs. We further present BMASS, a novel
dataset for evaluating linguistic regularities in biomedical embeddings, and
demonstrate that the relationships described in the dataset pose significant
semantic challenges to current word embedding methods.Comment: Accepted to BioNLP 2017. (10 pages
Automated Coding of Under-Studied Medical Concept Domains: Linking Physical Activity Reports to the International Classification of Functioning, Disability, and Health
Linking clinical narratives to standardized vocabularies and coding systems
is a key component of unlocking the information in medical text for analysis.
However, many domains of medical concepts lack well-developed terminologies
that can support effective coding of medical text. We present a framework for
developing natural language processing (NLP) technologies for automated coding
of under-studied types of medical information, and demonstrate its
applicability via a case study on physical mobility function. Mobility is a
component of many health measures, from post-acute care and surgical outcomes
to chronic frailty and disability, and is coded in the International
Classification of Functioning, Disability, and Health (ICF). However, mobility
and other types of functional activity remain under-studied in medical
informatics, and neither the ICF nor commonly-used medical terminologies
capture functional status terminology in practice. We investigated two
data-driven paradigms, classification and candidate selection, to link
narrative observations of mobility to standardized ICF codes, using a dataset
of clinical narratives from physical therapy encounters. Recent advances in
language modeling and word embedding were used as features for established
machine learning models and a novel deep learning approach, achieving a macro
F-1 score of 84% on linking mobility activity reports to ICF codes. Both
classification and candidate selection approaches present distinct strengths
for automated coding in under-studied domains, and we highlight that the
combination of (i) a small annotated data set; (ii) expert definitions of codes
of interest; and (iii) a representative text corpus is sufficient to produce
high-performing automated coding systems. This study has implications for the
ongoing growth of NLP tools for a variety of specialized applications in
clinical care and research.Comment: Updated final version, published in Frontiers in Digital Health,
https://doi.org/10.3389/fdgth.2021.620828. 34 pages (23 text + 11
references); 9 figures, 2 table
Writing habits and telltale neighbors: analyzing clinical concept usage patterns with sublanguage embeddings
Natural language processing techniques are being applied to increasingly
diverse types of electronic health records, and can benefit from in-depth
understanding of the distinguishing characteristics of medical document types.
We present a method for characterizing the usage patterns of clinical concepts
among different document types, in order to capture semantic differences beyond
the lexical level. By training concept embeddings on clinical documents of
different types and measuring the differences in their nearest neighborhood
structures, we are able to measure divergences in concept usage while
correcting for noise in embedding learning. Experiments on the MIMIC-III corpus
demonstrate that our approach captures clinically-relevant differences in
concept usage and provides an intuitive way to explore semantic characteristics
of clinical document collections.Comment: LOUHI 2019 (co-located with EMNLP
Characterizing the impact of geometric properties of word embeddings on task performance
Analysis of word embedding properties to inform their use in downstream NLP
tasks has largely been studied by assessing nearest neighbors. However,
geometric properties of the continuous feature space contribute directly to the
use of embedding features in downstream models, and are largely unexplored. We
consider four properties of word embedding geometry, namely: position relative
to the origin, distribution of features in the vector space, global pairwise
distances, and local pairwise distances. We define a sequence of
transformations to generate new embeddings that expose subsets of these
properties to downstream models and evaluate change in task performance to
understand the contribution of each property to NLP models. We transform
publicly available pretrained embeddings from three popular toolkits (word2vec,
GloVe, and FastText) and evaluate on a variety of intrinsic tasks, which model
linguistic information in the vector space, and extrinsic tasks, which use
vectors as input to machine learning models. We find that intrinsic evaluations
are highly sensitive to absolute position, while extrinsic tasks rely primarily
on local similarity. Our findings suggest that future embedding models and
post-processing techniques should focus primarily on similarity to nearby
points in vector space.Comment: Appearing in the Third Workshop on Evaluating Vector Space
Representations for NLP (RepEval 2019). 7 pages + reference
Jointly Embedding Entities and Text with Distant Supervision
Learning representations for knowledge base entities and concepts is becoming
increasingly important for NLP applications. However, recent entity embedding
methods have relied on structured resources that are expensive to create for
new domains and corpora. We present a distantly-supervised method for jointly
learning embeddings of entities and text from an unnanotated corpus, using only
a list of mappings between entities and surface forms. We learn embeddings from
open-domain and biomedical corpora, and compare against prior methods that rely
on human-annotated text or large knowledge graph structure. Our embeddings
capture entity similarity and relatedness better than prior work, both in
existing biomedical datasets and a new Wikipedia-based dataset that we release
to the community. Results on analogy completion and entity sense disambiguation
indicate that entities and words capture complementary information that can be
effectively combined for downstream use.Comment: 12 pages; Accepted to 3rd Workshop on Representation Learning for NLP
(Repl4NLP 2018). Code at https://github.com/OSU-slatelab/JE
Classifying the reported ability in clinical mobility descriptions
Assessing how individuals perform different activities is key information for
modeling health states of individuals and populations. Descriptions of activity
performance in clinical free text are complex, including syntactic negation and
similarities to textual entailment tasks. We explore a variety of methods for
the novel task of classifying four types of assertions about activity
performance: Able, Unable, Unclear, and None (no information). We find that
ensembling an SVM trained with lexical features and a CNN achieves 77.9% macro
F1 score on our task, and yields nearly 80% recall on the rare Unclear and
Unable samples. Finally, we highlight several challenges in classifying
performance assertions, including capturing information about sources of
assistance, incorporating syntactic structure and negation scope, and handling
new modalities at test time. Our findings establish a strong baseline for this
novel task, and identify intriguing areas for further research.Comment: Appearing in BioNLP 2019. 10 pages; 6 tables, 2 figure
Improving Broad-Coverage Medical Entity Linking with Semantic Type Prediction and Large-Scale Datasets
Medical entity linking is the task of identifying and standardizing medical
concepts referred to in an unstructured text. Most of the existing methods
adopt a three-step approach of (1) detecting mentions, (2) generating a list of
candidate concepts, and finally (3) picking the best concept among them. In
this paper, we probe into alleviating the problem of overgeneration of
candidate concepts in the candidate generation module, the most under-studied
component of medical entity linking. For this, we present MedType, a fully
modular system that prunes out irrelevant candidate concepts based on the
predicted semantic type of an entity mention. We incorporate MedType into five
off-the-shelf toolkits for medical entity linking and demonstrate that it
consistently improves entity linking performance across several benchmark
datasets. To address the dearth of annotated training data for medical entity
linking, we present WikiMed and PubMedDS, two large-scale medical entity
linking datasets, and demonstrate that pre-training MedType on these datasets
further improves entity linking performance. We make our source code and
datasets publicly available for medical entity linking research.Comment: 35 page
Robust Knowledge Graph Completion with Stacked Convolutions and a Student Re-Ranking Network
Knowledge Graph (KG) completion research usually focuses on densely connected
benchmark datasets that are not representative of real KGs. We curate two KG
datasets that include biomedical and encyclopedic knowledge and use an existing
commonsense KG dataset to explore KG completion in the more realistic setting
where dense connectivity is not guaranteed. We develop a deep convolutional
network that utilizes textual entity representations and demonstrate that our
model outperforms recent KG completion methods in this challenging setting. We
find that our model's performance improvements stem primarily from its
robustness to sparsity. We then distill the knowledge from the convolutional
network into a student network that re-ranks promising candidate entities. This
re-ranking stage leads to further improvements in performance and demonstrates
the effectiveness of entity re-ranking for KG completion.Comment: The Joint Conference of the 59th Annual Meeting of the Association
for Computational Linguistics and the 11th International Joint Conference on
Natural Language Processing (ACL-IJCNLP 2021
Diving for Pearls: Indexing Mobility Information in Social Security Administration Clinical Records with a Neural Relevance Tagger
Engineering: 2nd Place (The Ohio State University Edward F. Hayes Graduate Research Forum)A three-year embargo was granted for this item